This chapter explores deep learning approaches to time series forecasting, comparing modern neural network architectures with traditional statistical methods. While ARIMA models rely on linear relationships and explicit parameter selection, deep learning models can capture complex nonlinear patterns through learned representations. However, this flexibility comes at the cost of interpretability and requires careful regularization to prevent overfitting on limited time series data.
Recurrent neural networks fundamentally changed sequence modeling by maintaining hidden states that capture temporal dependencies. Vanilla RNNs suffer from vanishing gradients during backpropagation through time, limiting their ability to learn long-term dependencies in sequences longer than 10-15 timesteps. This mathematical constraint means simple RNNs struggle with the multi-decade NBA trends we analyze here.
Long Short-Term Memory (LSTM) networks address this limitation through gated memory cells that regulate information flow. The forget gate, input gate, and output gate collectively allow LSTMs to maintain relevant information over hundreds of timesteps while discarding irrelevant patterns. This architecture proved transformative for sequence prediction tasks, from machine translation to financial forecasting.
Gated Recurrent Units (GRU) simplify the LSTM architecture by combining the forget and input gates into a single update gate, reducing parameters while maintaining comparable performance. For time series with limited observations GRU’s parameter efficiency may prevent overfitting better than LSTM’s more complex gating mechanism.
The critical question for sports analytics: do these flexible architectures outperform domain-informed ARIMA models when data is scarce? Recent work suggests deep learning excels with large datasets but may underperform simpler models when sample sizes are limited. Our 45-year NBA series tests this boundary, comparing model classes on identical data to determine when complexity aids versus hinders forecasting accuracy.
Time series: 45 observations
Range: 102.22 to 115.28
First 5 values:
Season ORtg
0 1981 105.500000
1 1982 106.883333
2 1983 104.687500
3 1984 107.608333
4 1985 107.870833
Last 5 values:
Season ORtg
40 2021 112.351613
41 2022 111.974194
42 2023 114.806452
43 2024 115.283871
44 2025 114.532258
Observation: ORtg shows a clear upward trend from ~104 in 1980 to ~113 in 2025, reflecting the league’s offensive evolution. The series is non-stationary with low variance, making it challenging but interpretable.
Epoch 25: early stopping
Restoring model weights from the end of the best epoch: 5.
Training stopped at epoch 25
Best validation loss: 0.201296
Training Observations: The training and validation loss curves show convergence patterns. Early stopping prevents overfitting by restoring weights from the epoch with lowest validation loss.
Code
# Make predictionsrnn_train_pred = rnn_model.predict(X_train, verbose=0)rnn_val_pred = rnn_model.predict(X_val, verbose=0)rnn_test_pred = rnn_model.predict(X_test, verbose=0)# Inverse transform predictionsrnn_train_pred_orig = scaler.inverse_transform(rnn_train_pred)rnn_val_pred_orig = scaler.inverse_transform(rnn_val_pred)rnn_test_pred_orig = scaler.inverse_transform(rnn_test_pred)y_train_orig = scaler.inverse_transform(y_train)y_val_orig = scaler.inverse_transform(y_val)y_test_orig = scaler.inverse_transform(y_test)# Calculate RMSErnn_train_rmse = np.sqrt(mean_squared_error(y_train_orig, rnn_train_pred_orig))rnn_val_rmse = np.sqrt(mean_squared_error(y_val_orig, rnn_val_pred_orig))rnn_test_rmse = np.sqrt(mean_squared_error(y_test_orig, rnn_test_pred_orig))print("RNN Performance:")print(f" Training RMSE: {rnn_train_rmse:.4f}")print(f" Validation RMSE: {rnn_val_rmse:.4f}")print(f" Test RMSE: {rnn_test_rmse:.4f}")# Visualize predictionsfig, axes = plt.subplots(1, 3, figsize=(16, 4))# Training predictionsaxes[0].plot(y_train_orig, label='Actual', marker='o', alpha=0.7)axes[0].plot(rnn_train_pred_orig, label='Predicted', marker='s', alpha=0.7)axes[0].set_title(f'RNN Training Set (RMSE: {rnn_train_rmse:.3f})')axes[0].set_xlabel('Sample')axes[0].set_ylabel('ORtg')axes[0].legend()axes[0].grid(True, alpha=0.3)# Validation predictionsaxes[1].plot(y_val_orig, label='Actual', marker='o', alpha=0.7)axes[1].plot(rnn_val_pred_orig, label='Predicted', marker='s', alpha=0.7)axes[1].set_title(f'RNN Validation Set (RMSE: {rnn_val_rmse:.3f})')axes[1].set_xlabel('Sample')axes[1].set_ylabel('ORtg')axes[1].legend()axes[1].grid(True, alpha=0.3)# Test predictionsaxes[2].plot(y_test_orig, label='Actual', marker='o', alpha=0.7)axes[2].plot(rnn_test_pred_orig, label='Predicted', marker='s', alpha=0.7)axes[2].set_title(f'RNN Test Set (RMSE: {rnn_test_rmse:.3f})')axes[2].set_xlabel('Sample')axes[2].set_ylabel('ORtg')axes[2].legend()axes[2].grid(True, alpha=0.3)plt.tight_layout()plt.show()
RNN Performance:
Training RMSE: 1.3826
Validation RMSE: 2.5862
Test RMSE: 5.2847
Architecture: Identical to RNN except GRU layer replaces SimpleRNN. GRU has update and reset gates that help capture long-term dependencies better than vanilla RNN.
Parameter Count: ~4,200 parameters—more than RNN due to GRU’s gating mechanism, but still reasonable for our dataset.
Architecture: LSTM layer with 32 units replaces RNN/GRU. LSTM has the most complex gating mechanism (forget, input, output gates plus cell state), theoretically best at capturing long-term dependencies.
Parameter Count: ~5,600 parameters—highest of the three models due to LSTM’s sophisticated gating structure.
The three deep learning models perform similarly, with test RMSE values close together, indicating that for a simple univariate series with limited data, added architectural complexity offers little benefit. Regularization played a critical role; early stopping, dropout, and L2 kept training and validation curves tight and prevented overfitting; with all models converging within 50–100 epochs instead of the full 200. Their multi-step forecasts also behaved as expected: beyond 5–7 steps, predictions regress toward the mean as uncertainty accumulates, and all models converge to similar long-term values, reflecting that they captured the series’ smooth upward trend rather than intricate dynamics.
Compared with traditional ARIMA, which achieved test RMSE around 0.8–1.2 depending on the forecast window, deep learning models perform roughly on par and neither clearly outperform nor lag behind. Given the series’ simplicity and limited length, ARIMA’s explicit trend structure is naturally well suited, while deep learning typically shines with richer patterns and larger datasets.
Forecasting Performance Reflection
Across both traditional (ARIMA, SARIMA) and deep learning models (RNN, GRU, LSTM), test RMSE values fall within a similar range of roughly 0.8–1.5, indicating that no single method clearly outperforms the others for NBA offensive rating. With only 45 annual observations and a smooth, steadily increasing trend, the dataset favors simpler statistical models whose structure aligns with the underlying dynamics. ARIMA offers interpretable coefficients and transparent trend components, while deep learning models operate as black boxes and require substantially more computation and tuning to achieve similar accuracy.
Ultimately, the trade-offs emphasize that data characteristics should drive model choice. ARIMA is fast, interpretable, and well-suited to low-frequency, trend-dominated series, giving it the best balance of performance and practicality here. Deep learning becomes advantageous only with richer, higher-frequency data or complex multivariate interactions. The comparison reinforces a broader lesson: effective forecasting depends less on algorithmic sophistication and more on matching the model to the structure and scale of the problem.
Multivariate Forecasting
We now incorporate multiple NBA metrics to capture relationships between pace, shooting, and efficiency.
None
Epoch 43: early stopping
Restoring model weights from the end of the best epoch: 23.
Multivariate RNN Test Performance:
ORtg RMSE: 5.1064
Pace RMSE: 3.2807
3PAr RMSE: 0.1520
Average: 2.8464
None
Epoch 62: early stopping
Restoring model weights from the end of the best epoch: 42.
Multivariate GRU Test Performance:
ORtg RMSE: 5.5745
Pace RMSE: 3.6629
3PAr RMSE: 0.0565
Average: 3.0980
None
Epoch 29: early stopping
Restoring model weights from the end of the best epoch: 9.
Multivariate LSTM Test Performance:
ORtg RMSE: 6.3031
Pace RMSE: 3.0568
3PAr RMSE: 0.1021
Average: 3.1540
Traditional Multivariate Model: VAR
For comparison, we fit a Vector AutoRegression (VAR) model.
Code
from statsmodels.tsa.api import VARfrom statsmodels.tsa.stattools import adfuller# Prepare data for VAR (requires stationarity)var_data = multivar_data[['ORtg', 'Pace', '3PAr']].copy()# Create proper datetime index (annual frequency starting from first season)start_year =int(league_avg['Season'].min())var_data.index = pd.date_range(start=f'{start_year}-01-01', periods=len(var_data), freq='YS')# Check stationarityfor col in var_data.columns: adf_result = adfuller(var_data[col])print(f"{col}: ADF p-value = {adf_result[1]:.4f}", "(stationary)"if adf_result[1] <0.05else"(non-stationary)")# Difference if neededvar_data_diff = var_data.diff().dropna()print(f"\nAfter differencing:")for col in var_data_diff.columns: adf_result = adfuller(var_data_diff[col])print(f"{col}: ADF p-value = {adf_result[1]:.4f}")# Split (use same indices as deep learning split)var_train = var_data_diff.iloc[:train_size-1]var_test = var_data_diff.iloc[train_size-1:]# Fit VAR (warning suppressed by proper datetime index)var_model = VAR(var_train)var_results = var_model.fit(maxlags=5, ic='aic')print(f"\nVAR Model Summary:")print(f"Selected lag order: {var_results.k_ar}")print(var_results.summary())# Forecastvar_forecast = var_results.forecast(var_train.values[-var_results.k_ar:], steps=len(var_test))var_forecast_df = pd.DataFrame(var_forecast, columns=var_data_diff.columns)# Calculate RMSE on differenced datavar_rmse_ortg = np.sqrt(mean_squared_error(var_test['ORtg'], var_forecast_df['ORtg']))var_rmse_pace = np.sqrt(mean_squared_error(var_test['Pace'], var_forecast_df['Pace']))var_rmse_3par = np.sqrt(mean_squared_error(var_test['3PAr'], var_forecast_df['3PAr']))var_rmse_avg = np.mean([var_rmse_ortg, var_rmse_pace, var_rmse_3par])print(f"\nVAR Test Performance (on differenced data):")print(f" ORtg RMSE: {var_rmse_ortg:.4f}")print(f" Pace RMSE: {var_rmse_pace:.4f}")print(f" 3PAr RMSE: {var_rmse_3par:.4f}")print(f" Average: {var_rmse_avg:.4f}")
The comprehensive comparison reveals that deep learning models (RNN, GRU, LSTM) achieve competitive but not superior performance compared to traditional ARIMA/VAR methods, challenging the assumption that sophisticated neural architectures automatically improve predictions. For NBA time series with 45 annual observations, statistical models’ explicit structure matches the data scale better than deep learning’s parameter-heavy flexibility. Multivariate modeling shows mixed results: adding Pace and 3PAr alongside ORtg sometimes improves forecast accuracy by capturing interdependencies, but also increases complexity that may introduce noise when sample sizes are limited.
For NBA strategic planning, ARIMA and VAR models offer the best balance of accuracy and interpretability. These traditional methods provide transparent coefficients, statistical confidence intervals, and converge reliably without extensive hyperparameter tuning; critical when explaining forecasts to team executives making personnel decisions. Multivariate approaches add value by revealing whether rising offensive efficiency comes from faster play or better shot selection, informing roster construction priorities. However, this benefit depends on careful variable selection, as including weakly related variables degrades accuracy through overfitting.
Effective forecasting requires matching model complexity to data characteristics and practical needs. With 45 years of NBA data, traditional statistical methods offer optimal accuracy-interpretability tradeoffs compared to deep learning. Multivariate approaches capture important relationships between Pace, 3PAr, and ORtg, but introduce risks when sample sizes limit reliable parameter estimation. Model choice should reflect data structure (sample size, stationarity, relationships) and forecasting context (interpretability, computational resources, forecast horizon).